Real-Time Video Upscaling With ESPCN

Why ESPCN

Old video has a certain charm. Old video upscaled with bicubic interpolation does not: that blurry, vaseline-smeared quality where every edge dissolves into its neighbor. I had a collection of low-resolution clips I wanted to watch on a modern display, and bicubic made them look worse.

Most super-resolution networks upscale the image first, then run heavy convolutions at high resolution. ESPCN (Shi et al., CVPR 2016) flips this: all computation happens at the original low resolution, and upscaling happens only at the very end via sub-pixel convolution. This makes it fast enough for real-time video on a consumer GPU.

I implemented it from scratch in PyTorch.

Architecture

Four convolutional layers:

layer1: Conv2d(3, 64, kernel=5, padding=2) + LeakyReLU
layer2: Conv2d(64, 64, kernel=3, padding=1) + LeakyReLU
layer3: Conv2d(64, 32, kernel=3, padding=1) + LeakyReLU
layer4: Conv2d(32, n²×3, kernel=3, padding=1) + Sigmoid
upscale: PixelShuffle(n)

The trick is in the last layer: it produces n² channels for each color channel, and PixelShuffle rearranges those channels into spatial pixels. The network is effectively predicting a tiny grid of sub-pixels for every input pixel. By deferring the upscale to this final step, all the heavy convolution work stays at low resolution. The speedup over architectures that operate at target resolution is roughly n².

Training used MSE loss with Adam and a MultiStepLR scheduler dropping the learning rate at epochs 30 and 80. Each epoch sampled 500 random 240×240 crops, batch size 50. Nothing exotic; the architecture does the heavy lifting.

Green Tint

My first implementation followed the original paper: convert to YCbCr, super-resolve only the Y (luminance) channel, upscale Cb and Cr with bicubic. This is standard in the super-resolution literature because human vision is more sensitive to luminance detail than chrominance.

Except the output had a persistent green color shift.

Not dramatic, subtle enough that you might not notice on a single frame. But in video it accumulated into a sickly tint that made skin tones look wrong and skies look alien. Rounding errors in the YCbCr conversion, amplified across thousands of frames.

The fix: train on all three RGB channels directly. More computation, slightly lower PSNR on benchmarks (the network splits capacity across three channels instead of focusing on luminance), but the color fidelity was dramatically better. For video, perceptual quality matters more than peak signal-to-noise ratio.

Dark Scenes

The second bug was subtler. ReLU activations throughout meant the network's output was clipped at zero. Pixel values should land in [0, 1], and the network sometimes wanted to predict slightly negative values for dark regions. ReLU killed those, creating flat black patches where there should have been subtle shadow detail.

Switching to LeakyReLU for hidden layers and sigmoid for the output fixed it. LeakyReLU lets gradients flow through negative values during training; sigmoid constrains output to [0, 1] without hard clipping. A small code diff, a meaningful quality improvement in dark scenes.

Results

Tested on Set5, Set14, BSD100, and Urban100. At 2× upscale, around 31.5 dB PSNR, competitive with the original paper and visibly sharper than bicubic on every test image. Urban100 was the hardest: regular patterns like window grids and brick textures create aliasing, and extreme repetitive patterns sometimes produced faint Moiré artifacts.

The real test was video: edges crisp without ringing, textures with plausible detail, and no temporal flickering between frames. ESPCN processes each frame independently, but consistent training keeps the output stable enough that you don't notice.

Low resolution input image before super-resolution — The low-resolution input: bicubic downscaling loses fine detail and edges.

High resolution output after ESPCN super-resolution — The ESPCN reconstruction: sharper edges and recovered detail, without the smearing of bicubic upscaling.

The model is compact enough to serve as a baseline whenever I explore more complex architectures. Those old videos finally look decent on a 4K screen.